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Abstract 

Similar to variable selection in the linear regression model, selecting significant com- 
ponents in the popular additive regression model is of great interest. However, such 
components are unknown smooth functions of independent variables, which are unob- 
servablc. As such, some approximation is needed. In this paper, we suggest a combi- 
nation of penalized regression spline approximation and group variable selection, called 
the lasso- type spline method (LSM), to handle this component selection problem with 
a diverging number of strongly correlated variables in each group. It is shown that 
the proposed method can select significant components and estimate nonparametric 
additive function components simultaneously with an optimal convergence rate simul- 
taneously. To make the LSM stable in computation and able to adapt its estimators to 
the level of smoothness of the component functions, weighted power spline bases and 
projected weighted power spline bases arc proposed. Their performance is examined by 
simulation studies across two set-ups with independent predictors and correlated pre- 
dictors, respectively, and appears superior to the performance of competing methods. 
The proposed method is extended to a partial linear regression model analysis with real 
data, and gives reliable results. 

Keywords: Additive model, nonparametric component, group variable selection, penalized 
splines, lasso, generalized cross-validation. 
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1. Introduction 



Consider the additive regression model 

K 

Yi = a + Y,fk{Xki)+e^, (1.1) 

fe=i 

where Xki are the components of Xi = {Xu, Xxi), Efk{Xki) = 0, {/fc(-), A; = 1, . . . , K} 
are unknown smooth functions, and {ej} is a sequence of i.i.d random variables with a mean 
of and a finite variance cj^. This model was first proposed by Friedman and Stuetzle (1981), 
and has become a popular multivariate nonparametric regression model in practice. Hastie 
and Tibshirani (1990) gave a comprehensive review of this model and showed that it could 
be widely used in multivariate nonparametric modeling. 

The additive model provides an approximation, with an additive structure, for mul- 
tivariate nonparametric regression. There are at least two benefits of such an additive 
approximation. First, as every single individual additive component can be estimated using 
a univariate smoother in an iterative manner, the so-called "curse of dimensionality" that 
besets multivariate nonparametric regression is largely avoided. Stone (1985, 1986) theo- 
retically confirmed this by showing that one can construct an estimator of / that achieves 
the same optimal convergence rate for a general value of K as for K = 1. Second, the 
estimate of each individual component explains how the dependent variable changes with 
the corresponding independent variables; essentially, the simpler structure improves the 
interpret ability of the model. 

There are several methods available in the literature for fitting the additive model. 
These include the backfitting algorithm (Friedman and Stuetzle 1981; Buja, Hastie and 
Tibshirani 1989; Opsomer and Ruppert 1998), the smooth backfitting algorithm (Mammen, 
Linton and Nielsen 1999, Mammen and Park 2005, Nielsen and Sperlich 2005; Mammen 
and Park 2006; Yu, Park and Mammen 2008), marginal integration estimation methods 
(Tj0stheim and Auestad 1994; Linton and Nielsen 1995; Fan et al. 1998), the Fourier series 
or wavelets approximation approach (Amato, Antoniadis and De Feis 2002; Amato and 
Antoniadis 2001; Sardy and Tseng 2004), the penalized B-splines method (Filers and Marx 
2002), among others. 

To make the additive model more efficient, the search for a parsimonious version is 
clearly of importance. Although estimation has been intensively investigated, insignificant 
independent variables and function components increase the complexity of the model, which 
leads to a great computational burden and numerical unstability. Hence, deriving a method 
for obtaining estimations in a parsimonious additive model that still achieve an optimal 
convergence rate, as is the case with only one nonparametric component, is an interesting 
issue. 
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We use a real data example to demonstrate why selecting significant components and 
searching for a parsimonious additive model is of importance for statistical additive mod- 
eling. Fan and Peng (2004) used an additive model and penalized SCAD least-squares to 
analyze the employee dataset of the Fifth National Bank of Springfield based on data from 
1995 (see Example 11.3 in Albright et al. 1999). The bank, whose name has since changed, 
was charged in court with paying its female employees substantially smaller salaries than its 
male employees. For each of its 208 employees, the dataset includes the following variables. 

• EduLev: education level, a categorical variable with categories 1 (finished high school), 
2 (finished some college courses), 3 (obtained a bachelor's degree), 4 (took some 
graduate courses), 5 (obtained a graduate degree). 

• JobGrade: a categorical variable indicating the current job level, the possible levels 
being 1-6 (6 is the highest). 

• YrHired: the year that an employee was hired. 

• YrBorn: the year that an employee was born. 

• Gender: a categorical variable that takes the value "Female" or "Male". 

• YrsPrior: the number of years of work experience that employee had at another bank 
before working at the Fifth National Bank. 

• PCJob: a dummy variable that takes the value of 1 if the empolyee's current job is 
computer-related, and otherwise. 

• Salary: the current (year 1995) annual salary in thousands of dollars. 

Based on the discussions of Lam and Fan (2008) and Zhang (2008), both YrsExp and 
Age should have a nonlinear relationship with "Salary", and an additive model should be 
an appropriate model to fit the data. The is 0.8123. In the model, the nonparametric 
components of Age and YrsExp are included. This is informally confirmed by Figure [H 
which presents the estimated curves of "YrsExp" and "Age", respectively. However, the 
estimated function /i (YrsExp) is not an increasing function. This is inconsistent with the 
general intuition that salary should increase with "YrsExp". It is natural to explore the 
reasons behind this inconsistency. As we might suppose that "Age" and "YrsExp" will be 
strongly correlated, we may naturally ask whether the phenomenon regarding "YrsExp" is 
caused by inappropriately including insignificant variables or components in the model. To 
demonstrate the necessity of component selection, we manually remove one component to 
see what happens. That is, we consider two additive models, each of which includes either 
"Age" or "YrsExp". We find that the model without the "Age" component has a larger 
(0.8144) than the model with both "Age" and "YrsExp" , and the model without the 
"YrsExp" component has a smaller value of 0.8052. This indicates that we should keep 
"YrsExp" in the model. More importantly, from Figure [H we can see that when the "Age" 
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Figure 1: The estimated regression functions of "Age" or "YrsExp" respectively when both are 
included in the additive model. 
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Figure 2: The estimated regression functions of "Age" or "YrsExp" respectively when both are 
included in the respective additive model. 



component is selected out, the estimated function of "YrsExp" is an increasing function of 
"YrsExp", which fits the intuition. This also suggests that when insignificant components 
are selected out, the remaining components have a better explanatory power. As such, the 
means of automatically selecting the "Age" component out from the model is of importance, 
because we need to select out a nonparametric component rather than a variable that is 
observable. We thus need a new method to handle this modeling issue. 

1.1. Goals of the paper 

There have been some studies on variable selection in additive modeling. Smith and 
Kohn (1996) proposed a Bayesian approach to select significant variables. Chen and Hardle 
(1995) used a simple threshold method to select significant independent variables for the 
additive model, in which the function components are estimated by the marginal integration 
method. Shively, Kohn and Wood (1999) proposed a hierarchical Bayesian approach to 
variable selection and function estimation that uses a data-driven prior, and estimated their 
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functions by model averaging. Lin and Zhang (2006) penalized the norm of the two-order 
derivative of component functions to obtain sparse additive model in which the functions are 
estimated by using a smoothing spline technique. Ravikumar et al. (2009) proposed a new 
method to produce sparse additive model based on the idea of group variable selection and 
nonnegative garrote variable selection. All of these methods involve decoupled smoothing 
and sparsity, and penalize the norm of the estimated additive component functions to 
produce a sparse additive model. Most are based on classical variable selection methods 
for linear models, and hence cannot select significant independent variables and estimate 
the components simultaneously. The statistical properties of the estimates are also difficult 
to analyze. Furthermore, these methods impose a large computational burden, especially 
when there are many nonparametric function components to be estimated. 

Recently, some attempts have been made to resolve these problems in the additive 
models (see, for example, Meier, van de Geer and Biihlmann 2009 and Huang, Horowitz 
and Wei 2009). These methods are based on the B-spline and group variable selection 
techniques and are capable of estimating and selecting component functions simultaneously, 
even in high-dimension situations. However, the approach developed by Meier, van de Geer 
and Biihlmann (2009) seems to be unstable in the selection process, because it uses every 
observation as a knot, which results in much fluctuation. The method proposed by Huang, 
Horowitz and Wei (2009) does not provide optimal estimates for the component functions. 
This is well known problem with spline regression because the efficiency of the estimates 
depends on the number and the position of the knots. 

In this paper, we propose a lasso-type spline method (LSM) for component selection 
and estimation. First, we use a penalized regression spline approximation to parametrize 
the nonparametric components in the additive model, and then consider the spline approx- 
imation as a group of variables for selection. It is worth mentioning that in our setting, 
the design matrix in each group is formed from the truncated power spline basis functions. 
Hence, there is a diverging number of strongly correlated variables in each group, which 
makes the study more complicated and difficult. Nevertheless, the estimate of every single 
function component achieves the same optimal convergence rate as that in univariate local 
adaptive nonparametric regression splines, and our final selected model is rather parsimo- 
nious. To make the LSM in stable in computation and able to adapt its estimates to the 
level of smoothness of the component function, weighted penalized regression splines method 
and projected weighted penalized regression splines method are proposed. The two-stage 
estimation is obtained by using one-dimensional non-parametric techniques to refine the es- 
timates in the first stage, which serve as initial approximations for the additive components. 
Our proposed procedure depends on only one parameter, which controls both prediction 
error and misclassification error. Hence, to a certain degree, it reduces the computational 
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burden and attains computational stability. Simulation results illustrate that the method 
is superior in a set-up with independent predictors, and is comparable when the predictors 
are correlated. 

The outline of the remainder of this paper is as follows. In Section 2, we describe our 
new method, study its asymptotic properties, and propose an approximation algorithm. In 
section 3, simulations and a real data application are presented to illustrate the performance 
of the proposed method. A brief conclusion and discussion are given in Section 4. The 
technical details of the proof are relegated to Section \5\ 

2. Methodology 

2.1. Penalized regression splines 

As the components in the additive model are unobservable nonparametric functions, it 
is impossible to perform selection directly, and an approximation is needed. To this end, 
we first examine the univariate nonparametric regression model with only one independent 
variable as a basis for our method. 

Yi = mo{Xi) + e,, i = 1, . . . , n, 

where Xi is in [0, 1]. Mammen and Van de Geer (1997) proposed the use of the total 
variation TV{ml^ of the function mo(-) as a penalty and to minimize the following 
penalized sum of the squared residuals to obtain the estimation of mQ{-), 

n 
i=l 

As with the smoothing spline, Mammen and Van de Geer (1997) proved that the minimizer 
of this equation falls into the spline space such that the estimate of mo(-) itself is also a 
spline function. They also showed that the estimate of mo(-) has some good asymptotic 
properties, such as local adaption and an optimal convergence rate. 

To implement their idea, consider the following spline space S{p,t) with knots 

t = {0 = to<ti< ...< tk+i = 1}. 

For p > 2, S{p, t) is defined as 

S{p,t) = {s(x) G CP"2[0, 1] where s{x) is a polynimial of the order p on each subinterval [tj, tj+i]}. 

When p = 1, S{p, t) is the set of step functions with jumps at the knots. 

It is known that the space S{p,t) is a k + p dimensional linear function space, and that 
the truncated power function series 

Xj; = {l,x,x^, . . . (x - ti)!p\ . . . , (x - tk)'^^} 
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forms its basis (see de Boor, 1978). Thus, if the number of knots k is sufficiently large, then 
we can approximate mo(x) by a spline function with the form 

k 

m{x, 13) = X.,f3 = po + Pix + ... + Pp-ixP-^ + Pp+i-ii^ " (2.1) 

i=l 

Note that 

k k 

ry(m(f"i)(x,/3)) = J]|m(*'"i)(t„/3)-m(P"i)(t,_i,/3)| = (p-l)!j]|/3p_i+,|. 

i=l i=l 

By minimizing 

n p 

min^(y, - m{xi,f3))^ + A ^ (2.2) 

i=l 1=1 

we can obtain an estimate m{x, f3) for the function mo(-). 
2.2. Component selection for the additive model 

We now return to the additive regression model 

K 

Yi = a + Y fk{Xki) + eu i = l,...,n. 

k=l 

For every function component, we assume that Efk{-) = 0, k = 1, . . . , K , is approximated 
by the spline function 

Pk 

fkix) = ho + hix + ... + /3fc(p-i)xP~^ + ^/3fc(pfe-i+p)(a; - %)+~^ (2.3) 

where {tfcj, j = 1, . . . ,Pk} is the series of knots for the /cth function component. 

For any k, let {Bkj{-), j = 1, . . . ,pk} be the spline bases (note that although the number 
of bases should be pfc + p — 1, for convenience, we still denote it as pfc). The additive model 
can then be approximated by the following linear model. 

Yi = a* + Y,{Y. PkjBkj{Xki)) + Si, i = 1, . . . , n. (2.4) 

fc=l 3=1 

For any k with 1 < k < K , the basis series Bkj{-), j = 1, . . . ,pk can be regarded as a natural 
group of variables in the foregoing linear model, and the group variable selection can be used 
to estimate /3kj and to select the grouped variables. We combine the hierarchical LASSO 
method (Zhou and Zhu's 2007), the group bridge approach (Huang et al. 2009), and the 
ideas of Mammen and Van de Geer (1997) and propose the criterion 

K Pk K 



^min ^lYi-a* - ^^hjEkjiXki) > + ^|/3fci| + \l3k2\ + ■■■ + lApJ- 

" ' i=l I k=l j=l I k=l 



(2.5) 
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to select the groups. That is, we simultaneously to select the significant components, and 
estimate the parameters /3kj- 

However, this linear approximation does not mean that the problem is exactly identical 
to the case in the classical linear model. First, to make a good approximation of the function 
/fc(")i Pki the number of basis functions in the spline approximation, must be sufficiently 
large, and theoretically, increases with the sample size n. Thus, even when K, the number 
of function components, is of a moderate size, the linear structure derived has a diverging 
number of predictors if we do want to regard the model as linear. Second, the grouped 
variables Bi.j(-),j = 1,...,^^, are all related to the variable Xi^, and are thus strongly 
correlated, especially when the power basis functions are used. Third, distinct from Zhou 
and Zhu (2007), in our setup, the estimation accuracy of the whole function, rather than 
the estimation accuracy of a particular coefficient, is of interest and importance. Thus, 
as the objective here is to find a good approximation of each function component, the 
asymptotical results obtained by Zhou and Zhu (2007), and Huang et al. (2009) can not be 
directly applied to the additive model. 

2.3. Asymptotic theory 

To study the asymptotic behavior of the model, we first consider a more general situa- 
tion. Let ^ be a class of functions on [0, 1]. For a linear subspace ^„ of we consider a 
penalty V : [0, oo) that satisfies 

v{h + h) < r{fi) + r{f2), /i, /2 G ^n, 

and 

V{af)<\a\V{f), /G^„,aGR. 

Consider the additive model (1.1) with E k = 1, . . . , K. For a tuning variable 
fk,k = 1, . . . , K are estimated by minimizing the penalized sum of squares over 

{1 " / ^ V ^1 

Write ^n(l) = {/ G ^n'.'P < !}• For a subset si of we denote the b entropy of si 
by logiV2((5, II • ||„,iy). This is the logarithm of the minimal number of balls of a radius b 
covering =e/, where || • ||„ is the L2-norm with respect to the empirical probability measure 
of i = 1, . . . , n with the form || g ||^= ^ ^Y^=\ Q^i^i)- To obtain the required result, we 
must first assume the following condition first. 

Condition 1 The errors E\,£2, ■ ■ ■ ,£n are independent, with Eei = 0, and have subgaus- 
sian tails. That is, there exist some positive (3 and T such that 

E[exp(/3e2)] < F < oo, for i = 1, . . . , n. 
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Theorem 1. Assume that Condition 1 holds. Let Cn be a positive number sequence such 
that for the functions fk,m k = 1, . . . K in ^„ we have 

\\fk,n-fk\\n = 0{n-'/^^+^^C^/^^+^^) and V{fk,n)<Cn,k = l,...,K. (2.6) 

Let Xn = Cn~'^/^'^^^^Or!'^ w)/{2+w) ^ Furthermore, assume that for some C > and < 
w <2, the following entropy hound condition is satisfied. 

logiV2('5, II • ||n,=^n(l)) < CJ"'" for all 6 >0. (2.7) 

We then have 

Wfk - fkWn = Op(n-i/(2+-)cW(2+-)), and V{fk) = Op(c„). 

From (2.3) — (2.5), we can define the penahy functional as the Li norm of the coefficients 
for the spline approximation f^{-), that is, 

nfk)= E i/^'^i- 

k=l 

In fact, this gives V{f^) = \Pk\ + {p - 1)! • TV(/*^^"^^), where TV denotes the 

total variation. We obtain the following result for the entropy of the total variation space. 

K 

Proposition 1. Define ^„(1) = {/ E ^„ : V{f) = Yl ^(/fc) < !}• There then exists a 

k=l 

constant M > such that 

logiV2(5, II • ||n,^„(l)) < M6~^/P. (2.8) 

To state our results for the asymptotic behavior of the penalized least-squares estimate 
(2.5), we need some further conditions. 
Condition 2 For any j 

max \hji+i - hji\ = o{k J ), — T- 1— 

i<i<kj mmi<j<fc^. hji 

where h\ = tji — tji-i, kj is the number of knots, and M > is a predetermined constant. 

Theorem 2. Assume that f^ G = 1,2,..., A', and that the total variation of its 

{p — l)-th derivative is bounded. Let A„ = Cn 2p+i^ where C is a large constant. Then, 
under Conditions 1 and 2, we have 



fk-fk\\n=Opin~T+2^) (2.9) 



1 

for min pk > n^p+i . 
l<k<K 
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2.4. Computation 



2.4.1. Algorithms 

The penalty function in (|2.5|) can be regarded as nonconcave. Hence, the quadratic approx- 
imation method and the iterative algorithm proposed by Fan and Li (2001) can be used 
to define estimates of the coefficients. First, consider the derivative of penalty function for 



Pkj. Let ^(/3fc) = V|/3fci| + ... + |/?fcpJ. This gives 

^ ^ A-sgn(/3fc,) _ ^ X./3,^ 

V|/3fci| + ... + |/3fcpJ |/3fc,VIAil + --- + l/3fepj' 



To simplify the notation, we rewrite ()2.5p in matrix form as 

(Y-Q-X/3)2 + A=!^(/3), (2.11) 

where X = (Xi,X2,...,X„)^, 

= {Bii{Xii), . . . , Bip-^{Xii), B2i{X2i), ■ ■ ■ ,B2p2{X2i), ■ ■ ■ , Bxp^iXxi)}'^ 

and =!^(/3) = Ef=i^(/9fc)- 

If (3* with nonzero coefficients (a*,/3J, . . . ,/3x) minimizes the equation ()2.1ip . then the 
following equation is satisfied. 

r = (X^.X^* + Sa(/3*))-1X^*Y, (2.12) 

where X^* = (Xj*, , . . . , X*)'^, and 



X* = {l,Xpi^i,...,Xii* i] 



T 



and 

Sa(/3*) = diag{0, l^'mi^\. ■ ■ .^'{Pk)/Pk}- 

Hence, as in Fan and Li (2001), given an initial value ^q, (j2.12p requires an iterative 
algorithm to update the estimate to (Bi according to the following equation 

/3i = (X^oX^„ + S,(/3o))-iX^^Y. 

Fan and Li (2001) suggested that this iterative step is similar to the one-step MLE if the 
initial value is sufficiently good. If a reasonable initial value of /3 is selected, then our 
algorithm should converge within a few steps. 



10 



2.4.2. Tuning parameter selection 



The tuning parameter A is very important for estimating /3. Fan and Li (2001) proposed 
using generalized cross-validation to select A. Let /9(A) be the estimate of /3 with the tuning 
parameter A. The generalized cross-validation statistic is defined as 

GCV(A) = (2.13) 

n {1 — e(A)/n|^ 

and 

A = argmin;,{GCV(A)}, 
where e(A) = trace[Px(/9(A))], Px0) = X^(XTx^ + Sa(/3))-1xT. 

According to Wang, Li, and Tsai (2007), the log(GCV) is very similar to the traditional 
model selection criterion AIC. Although AIC is an efficient selection criterion that selects 
the best finite-dimensional candidate model in terms of prediction accuracy, it is not a 
consistent selection criterion because it does not select a correct model with a probability 
approaching 1 as the sample size goes to infinity. However, for our proposed method the 
number of knots, or the dimension of (3 is very large and increases with the sample size 
n, and thus an adjustment for such a criterion is necessary. Accounting for the effect of 
dimensionality to correctly select the significant variables, we suggest using the inflated 
factor for GCV. A modified generalized cross-validation (MGCV) is defined as 



MGCV(A) = (2.14) 

n {I — 7e(A)/n}"^ 



where 7 is the infiated factor. When 7 = 1, the MGCV is no different from the GCV 
proposed by Fan and Li (2001). Based on our experience and the discussions of Luo and 
Wahba (1997) and Friedman and Silverman (1989), we suggest selecting 7 within the interval 
(1.2,3) as an extra penalty. 

2.5. Further Considerations 



2.5.1. Weighted penalized regression splines method 



Our method is based on the power spline regression. It is well known that the power spline 
regression is not stable in computation because of a strong correlation between power bases, 
and many base functions are related to only a few observations. To make our numerical 
results more stable, we weight the power spline base for every component function as 

j^*w = X* X W, 
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where = diag{(X*'^X*/n) ^} and X* is given in (j2.12p . (j2.1ip can then be rewritten 
as 

(Y - x™^/3*"')2 + A^(/3*"'). 

By some elementary calculations, it is easy to determine that when all of the components 
of X**" are independent, the variance of the least-squares estimate of (3*^ should be of 
the order 1/n. Also, when the sample points Xki,i = l,...,n are equally spaced, our 
method is equivalent to transferring the power base spline approximation to a B-spline 
approximation with an Li-norm penalty of a linear combination of the coefficients in the 
B-spline approximation. Furthermore, as the variance of the least-squares estimate of (3* 
is of the order 1/n, as to the wavelet approximation (Donoho and Johnstone 1994), the 
universal threshold ^/log n/n can be used to penalize each coefficient. In other words, the 
tuning parameter can be searched within a small interval with the length 0(ylogn/n). 
These modifications result in a stable final penalized component function estimate. 

2.5.2. Projected weighted penalized regression splines method 

To make our final estimated model parsimonious and easy to interpret, in addition to 
selecting significant component functions, we also suggest the following procedure for the 
component estimation. Note that the power spline approximation (see the definition of 
Xj; above (2.1)) expands the component function as the sum of a polynomial and a linear 
combination of truncated power base functions. Divide X*"' by (1, B*'^, . . . , B^l^ , . . . , B"^), 

k-l k 

with i?^*" being the block from the (2 -|- ^ Pz)-th column to the (1 -|- P/)"th column in 

1=1 1=1 
matrix X*'". Here, 1 is an n x 1 column vector in which all the elements are equal to 1. 

We then write B'^ = {B'^ ^k2)' where B^^ are the coefficients of the polynomial part 

and B^2 the coefficients of the truncated power base functions. We regard these as 

two groups and then penalize each group separately, which make it possible to adaptively 

estimate the component functions when they are actually polynomial functions without any 

great effect from the truncated power base functions. This provides a way of adaptively 

estimating the component functions if they are actually polynomial, and means that the 

estimation is adaptive to the level of smoothness of the component functions. This approach 

may result in a more parsimonious estimation than that obtained by the previous estimation 

algorithm. However, we note that the two groups are strongly correlated. To realize the 

approach and to make the algorithm more efficient, we consider the following empirical 

power base functions. 

Br = {Bkr {i-pbi^)bi^], 
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where -Pb*™ is the project matrix from to B^^. This projection method is able to 
reduce the correlation between the two groups in the spline approximation. Let 

X?"" = (1, B*/"", Bl^"", B*jP'"). 

(|2.11|) can also be written as 

(Y - x*P'"P*P'"f + X^{f3*P'") + X^if^l^"^), 

where ^{(3*^^) are the group penalty functions for the polynomial coefficients for all of 
the component additive functions and 3^ {(3*2^) are the group penalty functions for the 
coefficients of the truncated power bases. 



2.5.3. Two-stage estimation 

When the dimension of the additive regression model is very high, selecting significant 
component functions becomes very difficult. The model selection and estimation may not 
be consistent, and the estimation procedure may also become unstable. To improve the 
estimation accuracy, we suggest a two-stage estimation approach. In the first stage, we 
use our proposed methods to select and estimate significant component functions as initial 
approximations of all of the selected components. Let 

M. = {k : fkiXk) selected to be significant based on our methods} 

and denote the corresponding estimates by k £ A4. In the second stage, we obtain 
refined estimates as follows. For the /s(-) selected in the first stage, define 

V = y^- E fkiXki), i = l,2,...,n 

and then estimate fs{Xsi) non-parametrically using the following model 

Y* = UXsi) + e*, i = l,...,n. 

For this new model, we can again use the method applied in the first-stage estimation to 
obtain the final estimator of fs{Xsi)- 

3. Numerical studies 

3.1. Simulations 

We conduct simulations to examine the effectiveness of the proposed lasso-type spline 
method for component function selection and estimation in the additive regression model. 
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The algorithm proposed in Section 12.4.11 is called the original lasso-type spline method 
(OLSM), that in Section [2.5.11 the weighted lasso- type spline method (WLSM), and that 
in Section [2.5.21 the projected weighted lasso- type spline method (PWLSM). We also com- 
pare the results with those obtained using the sparsity-smoothness penalty (SSP) approach 
recently proposed by Meier, van de Geer, and Biihlmann (2009) by using the R packages 
provided by the authors. For selection performance, we compute the true positive ratio 
(TPR) and false positive ratio (FPR); and for estimation accuracy, we compute the empir- 
ical prediction mean square error (MSE). Letting be the estimator of fk, MSE is defined 
as 

1 " 

-/ ^\fk{Xki) - fk{Xki))\ , 

4 = 1 

where {1^, Xu, . . . , Xxi} are the data points. 

In the simulations, the sample size n = 400 and a total of 100 simulation runs are used. 
To reduce the computational burden, the knots are designed as follows. Let the number of 
knots be k. For each predictor Xj, the knots are selected to be the [nj'/fcj-th order statistics 
{XjQ^j/fc]), j = 1, . . . , A;} of {Xji^ . . . , Xjn}- Quadratic splines are used, which gives a total 
number of base functions with K function components ol [2 + h)K + 1. To check the 
sensitivity of the methods to the knot number selection, we tried the values 10, 15, 40, and 
60 with a fixed A, and found that the numerical results did not differ much. We thus posit 
that our proposed three procedures are insensitive to the initial knot number as long as it 
is sufficiently large. However, with a larger number knots, the computation time is grated 
and the performance is a little worse, as the computation may be less stable due to strongly 
correlated variables in the splines. We thus set k at 15 in the simulations. The penalty 
parameter A is found to be critical. We choose A by computing the MGCV criterion defined 
in (|2.14p for a grid of a values and choosing the minimizer over the grid. The inflated 
factor 7 is taken to be 1.5. The grid of A for all three proposed procedures has 100 values 
and satisfies the condition that the values of logiQ(A) are equally spaced between —5 and 2. 
The sparsity-smoothness penalty approach (SSP) require teh selection of two parameters 
Ai and A2, where the former serves to control the sparsity and the latter the smoothness. 
Both parameters are chosen by using 100 grid points for Ai and 15 grid points for A2 in the 
spirit of Meier, van de Geer, and Biihlmann (2009). The simulation experiments are similar 
to those in Example 1 and Example 3 of Meier, van de Geer, and Biihlmann (2009). As 
our focus is on simultaneous selection and estimation, K is chosen to be 50 rather than an 
ultra-high dimension. 

Example 1. (Covariates are independent). The data are generated from 

50 

y = /i(Xi) + /2(X2) + f'siXs) + U{X^) + ^ fk(Xk) + e, 

k=5 
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where 

= -sm(2x), f2{x) = - 25/12, h{x) = x, 
/4(x) = exp(-x) -2sinh(5/2)/5, fk{x)=Q, if k>5, 

and e ~ -/V(0, 1). The predictors are sampled from the uniform distribution of (—2.5, 2.5). 
Example 2. ( Covariates are correlated). The model is 

50 

Y = fi{Xi) + /2(X2) + fsiXs) + /4(X4) + ^ fk{Xk) + e, 



k=5 



with 



fi{x) = 5x, /2(x) = 3(2x-l)2, ;3(^) = ^Igg^^ 

fi{x) = 0.6sin(27rx) + 1.2cos(27rx) + 1.8sin2(27rx) + 2.4cos2(27rx) + 3sin3(27rx), 
fk{x) = 0, if k>5, 

and £ ~ A^(0, 1.74). The covariates X = {Xi, . . . ,Xk) are generated from 

Xh = — ^— ^- , k = 1, . . . , K, 

1 + 0.5 ' ' ' ' 

where Wi, . . . , Wk and U are i.i.d uniformfO, 1). This provides a design with a correlation 
coefficient of 0.5 between all of the covariates. 

The simulation results are reported in Tables [T] and [2J The median of the MSE and the 
robust standard deviation of the MSE (the ratio of the interquartile and the standard normal 
interquartile (^>-^(0.75) - $"^(0.25))) are reported. "MSE/^^" means the MSE value of the 
estimates for /fc(-), and "MSE" means the MSE for the full model. The row "SSP" shows 
the results of the SSP method developed by Meier, van de Geer, and Biihlmann (2009). The 
rows "OLSM" , "WLSM" , and "PWLSM" respectively summarize the results that are based 
on "oracle" (assuming that all of the functions are known except that to be estimated), "one- 
stage", and "two-stage" estimates for the original lasso-type spline method, the weighted 
lasso-type spline method, and the projected weighted lasso-type method, respectively. The 
TPR and FPR results for each method are reported in Table [3l The curve estimations for 
the component functions are respectively summarized in Figures [3]— [8l 

The results tabulated in Tables [1] and [3] and plotted in Figures [5] and [7] with independent 
covariates in Example [T] show the differences among the "oracle" cases and the OLSM, 
WLSM, and PWLSM cases to be nearly negligible. The MSE of the two-stage estimates is 
significantly smaller than that of the corresponding one-stage estimates and approximate 
that in the "oracle" case. Of all of these methods, the two-stage estimates of the PWLSM 
are superior to the others. The numbers of true positives and false positives for the PWLSM 
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Table 1: Mean squared error (MSE) for Example [T] (the numbers in parentheses are the robust 



standard deviations estimations). 







MSE/j 


MSE/2 


MSE/3 


MSE/4 


MSE 


SSP 




0.122(0.189) 


0.105(0.210) 


0.105(0.189) 


0.124(0.188) 


1.358(0.277) 


ULbiVl 


Oracle 
One-stage 
Two-stage 


0.024(0.110) 

O.OoO(0.142j 
0.526(0.145) 


0.013(0.108) 

U.il9(U.Zl / j 
0.022(0.143) 


0.008(0.080) 

U.iyU(U.Zl 1 ) 
0.035(0.188) 


0.028(0.126) 
0. 511(0. iooj 
0.366(0.165) 


0.082(0.217) 

1.00 / (0.295) 

0.949(0.267) 


WLSM 


Oracle 
One-stage 
Two-stage 


0.020(0.105) 
0.479(0.292) 
0.040(0.592) 


0.013(0.111) 
0.025(0.133) 
0.014(0.118) 


0.008(0.083) 
0.052(0.204) 
0.013(0.093) 


0.027(0.128) 
0.071(0.175) 
0.038(0.142) 


0.081(0.219) 
0.615(0.338) 
0.141(0.575) 


PWLSM 


Oracle 
One-stage 
Two-stage 


0.011(0.101) 
0.115(0.141) 
0.022(0.111) 


0.015(0.112) 
0.016(0.115) 
0.016(0.125) 


0.006(0.084) 
0.008(0.094) 
0.011(0.108) 


0.024(0.130) 
0.042(0.158) 
0.027(0.146) 


0.078(0.210) 
0.194(0.265) 
0.090(0.251) 



are the same as those of the SSP approach. However, the OLSM has a smaller number of 
true positives and the WLSM has a larger variation in true positives than the SSP, although 
the true positives for the WLSM is the same as that for the SSP. Figures O El and [7] show 
that the OLSM and WLSM may fail to select the first component function. This is because 
the first function is small in magnitude, and is easily selected out from the model due to the 
penalty. In contrast, the PWLSM always selects all of the non-zero component functions 
into the model, and the knots used in the spline basis are very sparse. Furthermore, the 
PWLSM selects the linear function (for example, the third panel on the top) as linear, 
whereas the SSP fails to do so. The PWLSM thus outperforms the other methods in this 
setting. 

For the correlated covariate case in Example [21 the results presented in Tables [2] and 
[3] and Figures [H [6] and [8]) suggest that all of the "oracle" estimations perform similarly. 
Our proposed three LSMs all apparently improve on the method the SSP in terms of the 
MSE. The numbers of true positives and false positives for the WLSM are the same as those 
for the SSP. However, the PWLSM does not perform better in every respect, as it keeps 
selecting all of the true components at the expense of including a component that is slightly 
more noisy than the other insignificant components. 

3.2. Application 

In this section, we give more details of the analysis of the dataset described in Section [H 
As described, the dataset has been analyzed by Fan and Peng (2004) with the linear model 
and the additive model, respectively. We now use the method proposed in this paper to 
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Figure 3: Estimation curves from the original Lasso-type spline method (OLSM) for Example [T] True 
functions fj and estimation curves for the first four components of the simulation run that achieved the 
median of the MSE are presented. The pictures in the second row are histograms of the knots used in the 
one-stage OLSM estimation for the first four components, respectively. 




0.15 0.16 O.n 0.18 0.19 0. I).2 11.25 0.3 0.2 0.1 0.2 0.3 0.4 05 0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 



Figure 4: Estimation curves from the original Lasso-type spline method (OLSM) for Example O True 
functions fj and estimation curves for the first four components of the simulation run that achieved the 
median of the MSE are presented. The pictures in the second row are histograms of the knots used in the 
one-stage OLSM estimation for the first four components, respectively. 
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Figure 5: Estimation curves from tiie weighted Lasso-type spline method (WLSM) for Example [T] True 
functions fj and estimation curves for the first four components of the simulation run that achieved the 
median of the MSE are presented. The pictures in the second row are histograms of the knots used in the 
one-stage WLSM estimation for the first four components, respectively. 



3 



- true curve 



two-stage WLSME 
2|'-' -one-stage WLSME 
---SSPE 



— true curve 
two-stage WLSME 

■ - one-stage HISME 

- - - SSPE 






' / 







— true curve 




hvo-stage WLSME 




- - one-stage WLSME 




---SSPE 








V 











8 




— true curve 








two-stage WLSME 




5 




- - one-stage WLSME 








---SSPE 




4 
2 


/ Y 

\^ 








U 













\i\ /■/ 


-2 






-4 







1.2 0.4 0.6 

X. 



0.4 U 

X. 



0.4 0.i 
^3 



0.4 0.6 

X, 



aocam 



0.2 0.25 0.3 0.3S 0.4 0.J5 



0.2 0.3 0.4 0.5 



0.2 0.3 0.4 0.5 0. 0.2 0.3 0.4 05 0.6 0.7 



Figure 6: Estimation curves from the weighted Lasso-type spline method (WLSM) for Example [5] True 
functions fj and estimation curves for the first four components of the simulation run that achieved the 
median of MSE are presented. The pictures in the second row are histograms of the knots used in the 
one-stage WLSM estimation for the first four components, respectively. 
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Figure 7: Estimation curves from the projected weigiited Lasso-type spline metiiod (PWLSM) for Example 
[T] True functions fj and estimation curves fj for the first four components of the simulation run that 
achieved the median of the MSE are presented. The pictures in the second row are histograms of the knots 
used in the one-stage PWLSM estimation for the first four components, respectively. 




0.25 (1.3 0.35 0.4 0.45 0.5 0.55 0. 11.1 0.2 0.3 0.4 0. 0.35 0.4 0.45 0.5 0.55 0.6 0.( 0.2 0.4 O.fi 0.8 1 



Figure 8: Estimation curves from the projected weighted Lasso- type spline method (PWLSM) for Example 
[2l True functions fj and estimation curves fj for the first four components of the simulation run that 
achieved the median of the MSE are presented. The pictures in the second row are histograms of the knots 
used in the one-stage PWLSM estimation for the first four components, respectively. 
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Table 2: Mean square error (MSE) for Example [5] (the numbers in parentheses are the robust 
standard deviations estimations). 







MSE/j 


MSE/2 


MSE/3 


MSE/4 


MSE 


SSP 




0.225(0.373) 


0.430(0.624) 


0.755(0.328) 


3.020(0.580) 


7.224(0.764) 


ULbiVl 


Oracle 
One-stage 
Two-stage 


0.025(0.116) 

0.133(0.2/ i) 
0.045(0.218) 


0.282(0.198) 
0.594(0.139 j 
0.589(0.181) 


0.047(0.148) 

0.394(^0.305 j 
0.163(0.256) 


1.566(0.635) 

2.315(0.611 j 
1.696(0.621) 


3.127(0.795) 
4.4/5(0. /o9j 
3.585(0.787) 


WLSM 


Oracle 
One-stage 
Two-stage 


0.018(0.110) 
0.072(0.140) 
0.027(0.149) 


0.281(0.187) 
0.470(0.412) 
0.312(0.464) 


0.050(0.154) 
0.309(0.284) 
0.046(0.161) 


1.558(0.623) 
1.880(0.623) 
1.574(0.631) 


3.112(0.780) 
3.842(0.831) 
3.216(0.818) 


PWLSM 


Oracle 
One-stage 
Two-stage 


0.025(0.112) 
0.066(0.124) 
0.022(0.134) 


0.295(0.210) 
0.403(0.364) 
0.295(0.236) 


0.055(0.156) 
0.195(0.212) 
0.052(0.159) 


1.573(0.630) 
1.759(0.606) 
1.555(0.628) 


3.178(0.778) 
3.590(0.784) 
3.176(0.760) 



analyze the dataset and to make a comparison with the results of Fan and Peng (2004) . 

Similar to the approach of Fan and Peng (2004), we also move out the outliers and use 
the 199 remaining observations for our analysis. Consider the additive model 

4 5 

Salary = /3o + /3i Female + /32PCJob + ^ /32+jEdui + ^ /Je+j JobGrdi 

+/i(YrsExp) + /2(Age) + e. (3.1) 

We use LASSO for the linear part and our method for the component function selection. 
The 2/17, 3/17, . . . , 16/17 sample quantiles of the variables "YrsExp" and "Age" are selected 
as knots, which gives 15 initial knots to estimate the component functions. Fan and Peng 
(2004) used only 5 knots to estimate each component function, whereas our method gives 20 
more parameters to model data. Despite this, the computational complexity is not increased 
because the quadratic approximation algorithm can be easily implemented, and most of the 
knots will be removed in an iterative fashion by our component selection procedure. In line 
with Fan and Peng (2004) and the foregoing discussion, we first weight the "design matrix" 
such that the original least-squares estimate has the standard deviation of every estimate of 
the coefficients of the prediction variables and a truncated power basis function close to the 
order of l/-v/n. Two tuning parameters are then used to select the variables in the linear 
part and the component functions. MGCV with an inflation parameter of 1.5 is used to 
select the tuning parameters. The results are reported in the fourth column (WLSM, see 
Section 13. ip of Table H] and in Figure [H 

Table S] shows that our method does not select the component function of "Age" . This 
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Table 3: Median of the number of true positives (TP) 
and false positives (FP) (the numbers in parentheses 
are the standard robust deviations estimations). 



Example [T] 



SSP 
OLSM 
WLSM 
PWLSM 



TP 
4.000(0.000) 
3.000(0.000) 
4.000(0.741) 
4.000(0.000) 



FP 

0.000(0.000) 
0.000(0.000) 
0.000(0.000) 
0.000(0.000) 



SSP 4.000(0.741) 0.000(0.741) 

OLSM 3.000(0.000) 1.000(1.483) 

Example El WLSM 4.000(0.741) 0.000(0.741) 

PWLSM 4.000(0.000) 1.000(0.741) 




is consistent with the result of SCAD-PLS for the linear model, which is reported in the 
second column of Table HI The other estimates of the coefficients are similar to those 
obtained with the first three methods. The function of "YrsExp" is now estimated as an 
increasing function (see Figure [9]). Only two spline bases, from among the 17 are selected to 
estimate the component function of "YrsExp" . Hence, the selected model is much simpler 
than the selected model derived with SCAD-PLS under the additive structure. The 
value in the fourth column for the WLSM is larger than that in the second or third column 
for SCAD-PLS. This means that, compared with the SCAD-PLS method for either the 
linear model or the additive model, our method provides a more reasonable estimation and 
selects a simpler model in this real data example. 
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Table 4: Estimates and standard errors for the Fifth National Bank data. 





Least-Squares 


SCAD PLS 


SCAD PLS 


WLSM 


Intercept 


54.238(2.067) 


55.835(1.527) 


52.470(2.890) 


55.820(1.437) 


Female 


-0.556(0.637) 


-0.624(0.639) 


-0.933(0.708) 


-0.693(0.656) 


PcJob 


3.982 (0.908) 


4.151(0.909) 


2.851(0.640) 


3.935(0.908) 


Edl 


-1.739(1.049) 


0(") 


o(-) 


O(-) 


Ed2 


-2.866(0.999) 


-1.074(0.522) 


-0.542(0.265) 


-1.385(0.764) 


Ed3 


-2.145(0.753) 


-0.914(0.421) 


0(~) 


-1.180(0.601) 


Ed4 


-1.484(1.369) 


O(-) 


O(-) 


O(-) 


Jobl 


-22.954(1.734) 


-24.643(1.535) 


-22.841(1.332) 


-23.325( 1.561) 


Job2 


-21.388(1.686) 


-22.818(1.546) 


-20.591(1.370) 


-21.494(1.580) 


Jobs 


-17.642(1.634) 


-18.803(1.562) 


-16.719(1.391) 


-17.440(1.602) 


Job4 


-13.046(1.578) 


-13.859(1.529) 


-11.807(1.359) 


-12.536(1.542) 


Jobs 


-7.462(1.551) 


-7.770(1.539) 


-5.235(1.150) 


-6.477 (1.537) 


YrsExp 


0.215(0.065) 


0.193(0.046) 


-(-) 


-(-) 


Age 


0.030(0.039) 


O(-) 


-(-) 


O(-) 




0.8221 


0.8176 


0.8123 


0.8182 



4. Conclusion 

In this paper, we propose a LASSO-type method for selecting nonparametric components 
in the additive regression model. We can use this method to simultaneously select and 
estimate components. Simulations show that for a high-dimensional additive model, the 
proposed methods can shrink the function components that correspond to the nonsignificant 
predictors exactly to zero and produce a parsimonious model. For an ultra-high dimensional 
additive model, we follow the idea of Fan et al (2009) and use then SIS method to first reduce 
the ultra-dimension of the additive model to a high dimension, and then use our proposed 
method to select and estimate the significant components. Intuitively, it is possible to 
extend this idea to generalized additive models with binary response data or poisson data, 
or with a given link function. Research in this area is ongoing. 

5. Proof of Theorems 

Proof of Theorem 1: As ^ = £'/o(X), a natural consistent estimate of ^ is // = 
^'Y^=iyi- Without loss of generality, assume that fi = 0. By the condition (2.6), there 
exist ^ such that 

ll/M-/A:l|n = 0(n-V(2+-)cW(2+-)) ^ud P(/,* J < C„, foT k = 1, . . . , K. 
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K 

Define / * = Yl, fkn^ ^'^^^ satisfies 
fc=i ' 

K K K 



fnWn = II - E/mII- < E 11^ - /mII" = 0(n-V(2+-)cW(2+-)) 



fc=l fe=l fc=l 

and 

fc=i 

By the definition of /„, we liave 



and 

||/n-/:il^ < A„(P(/*)-P(A) + -J^e,(A(X,)-/:(X,)) 



n ■' — ' n 

1=1 i=l 



2 " 



n . 



+ - j;(/(x,) - /:(x,))(/„(x,) - /:(x,)) 



n 
1=1 



2 " 



< A„(P(/:) - + - E e.(/„(X,) - /:(X,)) + 2||/: - fWn -Wfn- fnWn, 



n 

i=l 



where the second inequality is derived from the Cauchy-Schwarz inequality. Note that the 
condition (2.7) on the entropy bound implies 



l^^^/^Er=ign g(x 



sup J " =Op{l). 



1 

Define g = [fn — fn)/{'P{fn) + Kcn)"^. As ^„ is a linear space, it is easy to see that g € 
and V{g) < 1. Then, by the entropy bound, we have 

1 ^ 1 
I- J]e,(/„(X,) - /:(X,))| < ||/„ - /:i|n ' (Vifn) + Kc|nOp(n-i/2)|, 

i=l 

which implies that 

ii/n-/:ii^ < A„(T'(/:)-7'(/„)) + ii/„,-/:ii^ ^v{fn)+Kcmop{n~'/')\ 

~^ ^ 1 1 / ~ /n 1 1 n 1 1 /ra ~ /n I U 

= Rji + 2 1 1 / — /n 1 1 n 1 1 /n ~ /n 1 1 ■ 

This inequality holds only when either one of the following inequalities is fulfilled. 

||A-/*||2 <2i2„, or 
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||/n-/:i|n<4||/-/:|U, and R^<2\\f-f:U\U-f*\\n. 
To prove this, we consider each case separately. 

(i) If Wfn - f*\\n < 411/ - /*||„, and Rn < 211/ - f*\\n\\fn " f*\\n, then we have 

||/n-/:||n = 0p(n-^/('+-)C/(2+'")), 

and then 

Wfn - fWn < ll/n - /*||n + 11/*" f\\n = Opln" ^(2+-) C-/(2+-) ) . 

By Rr, < 211/ - f*\\n\\fn " f*\\n and ||/„ - /||„ = Op(n-V{2+«;)c-/(2+-))^ it is clear that 

V{fn) < n5 . (n-i/(2+-)c-/(2+-))i+f = ci 

(ii) When ||/n — /rJ||^, < 2i?„, we consider the two cases separately. 
Case (ii)-l. Vifn) > '^'P{fn)- The foregoing inequality implies that 

0<||/.-/||^^<-A„n/n) + ||/n-/:i|n"^ (^-P(/)J \Op{n~'/')\. 

Then, 

V{fn) < Xn Wfn " /nlln^'"™' | Op (n-^IT-^ ) | . 

Substituting this into the preceding equation, and noting that V{f*) —V{fn) < 0, we obtain 

^ '"(^-™) „ 1 

ll/n - /*||^ < Wfn - fnWn ' A„ ||/„ - /*|U^'^-' |0,(n-^-^)|. 

Invoking the condition for A„, we have 

Wfn - f:Wn < Xn ^"^^ \Op{n-^)\ = Op(n-l/(2+-)c-/(2+-)). 

Again, using this in the foregoing equation, we have 

n/n)=0(c|). 

Case (ii)-2. V{fn) < 2P(/*). The above equation yields 

ll/n - /:||^ < 2A„(P(/:) - V{fn)) + ll/n - /*||n" ^ (P(/n) + ^(/:))'" I Op (n^^/^) I , 

which implies that either 

ll/n - /:||^ < 4A„|P(/:) - V{f*)\ < UKchn 

or 

Wfn - /nll^ < 2||/n " /^Hn" ^ (P(/n) + T^^*))"' I O.Cn" ^2) | . 
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As V{fn) = 0{cn), both inequalities give 

||/n-/||n = Op(n-l/(2+-)c-/(2+-)). 

Following this equation and Proposition 1 in Stone (1985) and the definition of V{fn), 
we have 

||/nfc-M|n = Op(n-i/(2+«,)^W(2+-)) and V{h) = 0{cn) for l<k<K 
This complete the proof. □ 

Proof of Proposition [1} It is easy to verify that the functions in =^n(l) are uniformly 
bounded. By applying the results for entropy bounds in Birman and Solmjak (1967), we 
can easily obtain the solution. □ 

Proof of Theorem 2: By Conditions 1 and 2, and the results for the spline approximation 
(see de Boor, 1978), when the number of initial knots is sufficiently large, mini</c<^pfc > 
n^, it is obvious that the condition (2.6) of Theorem 1, ||/fc,„-/fc||„ = 0(n-i/(2+«;)p^/(2+t«)^ 
and V{fk,n) ^ Cn,k = 1, . . . ,K are satisfied when c„ is a constant and w = 1/p. By Proposi- 
tion[Tl the entropy condition (2.7) in Theorem 1 is also satisfied by the spline approximation 
function X^^^ fk,n when w = 1/p. Then, letting = Cn~'^P/^^~^'^P\ it is easy to see that 
Theorem 2 is a corollary of Theorem 1. □ 
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